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Automated Anomaly Detection 

This invention relates to automated anomaly detection in data, and to a method, an 
apparatus and computer software for implementing it. More particularly, although not 
exclusively, it relates to detection of fraud In areas such as telecommunications and retail 
5 sa | es anc j to detection of software vulnerabilities by searching for anomalies in digital data. 

It is known to detect data anomalies such as fraud or software vulnerabilities with the aid of 
management systems which use hand-crafted rules to characterise fraudulent behaviour. 
In the case of fraud, the rules are generated by human experts in fraud, who supply and 
update them for use in fraud management systems. The need for human experts to 
lo generate rules is undesirable because it is onerous, particularly if the number of possible 
rules is large or changing at a significant rate. 

It is also known to avoid the need for human experts to generate rules: i.e. artificial neural ;, 
networks are known which learn to characterise fraud automatically by processing training^ 
data. They use characteristics so learned to detect fraud in other data. However, neural 
15 networks characterise fraud in a way that is not clear to a user and does not readily • 
translate into recognisable rules, it is important to be able to characterise fraud in terms of 
breaking of acceptable rules, so this aspect of neural networks is a disadvantage. 

Known rule-based fraud management systems can detect well-known types of fraud 
because human experts know how to construct appropriate rules. In particular, fraud' over 

20 circuit-switching networks is well understood and can be dealt with in this way. However, 
telecommunications technology has changed in recent years with circuit-switching 
networks being replaced by Internet protocol packet-switching networks, which can 
transmit voice and Internet protocol data over telecommunications systems. Fraud 
associated with internet protocol packet-switching networks is more complex than that 

25 associated with circuit-switching networks: this is because in the Internet case, fraud can 
manifest itself at a number of points on a network, and human experts are still learning 
about the potential for new types of fraud. -Characterising complex types of fraud manually 
from huge volumes of data is a major task. As telecommunications traffic across packet- 
switching networks increases, it becomes progressively more difficult to characterise and 
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detect fraud. 

US Pat No. 6,601 ,048 to Gavan discloses rule-based recognition of telephone fraud by a 
thresholding technique: it establishes probabilities that certain occurrences will be 
fraudulent most of the time (e.g. 80% of credit card telephone calls over 50 minutes in 
5 length are fraudulent). It mentions that fraudulent behaviour is established from records but 
not how it is done, 

US Pat. No, 5,790,645 to Fawcett etal. also discloses rule-based recognition of telephone 
fraud. It captures typical customer account behaviour (non-fraudulent activity) and employs 
a standard rule learning program to determine rules distinguishing fraudulent activity from 

ID non-fraudulent activity. Such a rule might be that 90% of night-time calls from a particular 
city are fraudulent Rules are used to construct templates each containing a rule field, a 
training field monitoring some aspect of a customer account such as number of calls per 
day, and a use field or functional response indicating fraudulent activity, e.g. number of 
calls reaching a threshold. Templates are used in one or more profilers of different types 

is ' which assess customer account activity and indicate fraudulent behaviour: a. profiler may 
simply indicate a threshold has been reached by output of a binary 1 , or it may give a count 
of potentially fraudulent occurrences, or indicate the percentage of such occurrences in all 
customer account activity. The approach of detecting deviation from correct behaviour is 
more likely to yield false positives than detecting fraud directly, because it is difficult to 

20 characterise all possible forms of normal behaviour. 

US Pat Appln. No. US 2002/0143577 to Shiffman Qt ai discloses rule-based detection of 
compliant/valid non-compiiant/invalid responses by subjects in clinical trials. Quantitative 
analysis is used to distinguish response types. This corresponds to rule generation by 
human experts which is time consuming, There is no disclosure of automatic rule 
25 generation. 



US Pat Appln. No. US 2002/0147754 to Dempsey et a/, discloses detection of 
telecommunications account fraud or network intrusion by measuring difference between 
two vectors. 
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There is also a requirement for automated detection of potentially exploitable vulnerabilities 
in compiled software, i.e. binary code, by searching for code anomalies comprising 
potentially incorrect code fragments. A malicious attacker may be able to force such 
5 fragments to be executed in such a way as to cause a computer system running code 
containing the fragments to behave insecurely. 

Software vulnerabilities in computer source code are detectable using static anaiysis 
techniques, also referred to as white-box testing techniques. However, source code is 
frequently not available for analysis and white-box techniques are not applicable. 

10 It is also known to detect data anomalies in the form of vulnerabilities in compiled binary 
code and disassembled binary code using hand-crafted rules to identify potential bugs in 
the code. The rules are generated by human experts in vulnerability detection. For 
example, in a hand crafted rule set category, a "SmartRisk Analyzer" product of the ©stake 
company looks for "triggers" in a computer program written in assembly language code. 

15 "Triggers" are calls to functions (such as strcpy) known to be vulnerable. On finding a 
trigger, SmartRisk Analyzer traces a data and control path back through the program in 
order to determine possible values of parameters comprising an argument of the vulnerable 
or unsafe function, to see if the function call will be vulnerable during run time. So-called 
black-box testing technologies are more commonly used, . usually referred to as "fuzzers"; 

20 fuzzers essentially perform a random search or a brute force search through a (usually 
intractably large) space of test vectors. They can also be enhanced by hand crafting 
constraints on the search space's domain. 

As before, the need for human experts to generate rules is undesirable because it is 
onerous. Although human experts may have much experience, it is not feasible for them to 

25 learn from all possible scenarios. Gaining additional and wider experience takes time and 
resources. Once a rule base is derived, it can be used to identify whether new software 
applications contain potentially exploitable binary code. However, current systems of 
vulnerability detection have rule bases which are typically static, i.e. unchanging over time 
unless rules are added or edited manually. As new vulnerabilities become apparent, such 

30 a system needs to be updated by hand in order to be able, to identify associated 'bugs'. 



10^22025 flfj?ftpvi3j IfifEl . 
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Further deficiencies of a rule-based approach, such as that used by @Stake, is that ft has a 
limitation on 'semantic depth' that Is practical far such techniques. A vulnerability having 
semantics which are sufficiently complex is not likely to be detectedby such ari approach- 
United Kingdom Patent GB 2387681 discloses machine learning of rules for network 
5 security. This disclosure concentrates on use of first-order logic to represent rules for 

dealing with the problem of intrusion detection. It involves firstly attempting to characterise, 
either pre-emptively or dynamically, behaviours on a given computer network that 
correspond to potentially malicious activity; then, secondly, such characterisation provides 
a means for preventing such activity or raising an alarm when such activity takes place. 

10 < Intrusion detection techniques, such as that proposed in GB 2387681, do not address the 
.problem of finding underlying vulnerabilities that might be used as part of an intrusion, 
rather they are concerned with characterising and monitoring network activity. Intrusion 
detection systems use on-line network monitoring technology rather than a static off-line 
assessment of code binaries. They therefore detect intrusion after It has happened, rather 

15 than forestalling it by detecting potential code vulnerabilities to enable their removal prior to 
exploitation by an intruder. 

It is an object of the present invention to provide an alternative approach to anomaly 
detection. 



The present invention provides a method of anomaly detection characterised in that it 
20 incorporates the steps of:- 

a) developing a rule set of at least one anomaly characterisation rule from a training data 
set and any available relevant background knowledge using at least first order logic, a 
rule covering a proportion of positive anomaly examples of data in the training data set, 
and 

f 

25 b) applying the rule set to test data for anomaly detection therein. 

in an alternative aspect the present invention provides an automated method of anomaly 
detection characterised in that it comprises using computer apparatus.to execute the steps 
of:- 

a) developing a rule set of at least one anomaly characterisation rule from a training 
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data set and any available relevant background knowledge using at least first order 
logic, a rule covering a proportion of positive anomaly examples of data in the 
training data set, and 
b) applying the rule set to test data for anomaly detection therein. 

5 The method of the invention provides the advantage that it obtains rules from data* not 
human experts, it does so automatically, and the rules are not invisible to a user. At least 
first order logic is used to generate the rule set, which allows variables in rules and general 
relationships between them, and it is possible to include background knowledge. In the 
sense used in this specification, an anomaly is a portion of data indicating some feature or 

10 features which it is desired to locate or investigate, for example fraudulent behaviour or a 
potentially incorrect fragment of computer program code indicating a software vulnerability. 

Data samples in the training data set may have characters indicating whether or not they 
are associated with anomalies. The invention may be a method of detecting- • 
telecommunications or retail fraud or software vulnerabilities from anomalous data and may* 
15 employ inductive logic programming to develop the rule set 

Each rule may have a form that an anomaly is detected or otherwise by application of the 
rule according to whether or not a condition set of at least one condition associated with the 
rule is fulfilled. A rule may be developed by refining a most general rule by at least one of: 
a) addition of a new condition to the condition set; and 
20 b) unification of different variables to become constants or structured terms. 

A variable in a rule which is defined as being in constant mode and is numerical is at least 
partly evaluated by providing a range of values for the variable, estimating an accuracy for 
each value and selecting a value having optimum accuracy. The range of values may be a 
first range with values which are relatively widely spaced, a single optimum accuracy value 
25 being obtained for the variable, and the method including selecting a second and relatively 
narrowly spaced range of values in the optimum accuracy value's vicinity, estimating an 
accuracy for each value in the second range and selecting a vaiUB in the second range 
*. having optimum accuracy. 

The method may include filtering to remove duplicates of rules and equivalents of rules, i.e. 
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rules having like but differently ordered conditions compared to another rule, and rules 
which have conditions which are symmetric compared to those of another rule, it may 
include filtering to remove unnecessary 'less than or equal to 1 ("Iteq") conditions. 
Unnecessary "Iteq" conditions may be associated with at least one of ends of intervals, 
5 multiple iteq predicates and equality condition and Iteq duplication. 

The method may include implementing an encoding length restriction to avoid overfltting 
noisy data by rejecting a rule refinement if the refinement encoding cost in number of bits 
exceeds a cost of encoding the positive examples covered by the refinement. 

Rule construction may stop if at feast one of three stopping criteria is fulfilled as follows; 

a) the number of conditions in any rule in a beam of rules being processed is greater 
than or equal to a prearranged maximum rule length, 

b) no negative examples are covered by a most significant rule, which is a rule that: 
i) is present in a beam currently being or having been processed, 
if) is significant, 

Hi) has obtained a highest likelihood ratio statistic value found so far, and 
iv) has obtained an accuracy value greater than a most general rule accuracy 
value, and 

c) no refinements were produced which were eligible to enter the beam currently being 
processed in a most recent refinement processing step. 

A most significant rule may be added to a list of derived rules and positive examples 
covered by the most significant rule may be removed from the training data set 

The method may include: 

a) selecting rules which have not met rule construction stopping criteria, 

b) selecting a subset of refinements of the selected rules associated with accuracy 
25 estimate scores higher than those of other refinements of the selected rules, and 

c) iterating a rule refinement, filtering and evaluation procedure to identify any refined rule 
usable to test data. 



15 
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in another aspect, the present invention provides computer apparatus for anomaly 
detection characterised in that it is programmed to execute the steps of> 

a) developing a rule set of at least one anomaly characterisation rule from a training data 
set and any available relevant background knowledge using at least first order logic, a 

5 rule covering a proportion of positive anomaly examples of data in the training data set r 
and 

b) applying the rule set to test data for anomaly detection therein. 

The computer apparatus may be programmed to develop the rule set using Higher-Order 
logic, it may include developing the rule set by: 
10 a) farming an alphabet having selector functions allowing properties of the training data 
set to be extracted, together with at least one of the following: additional concepts, 
background knowledge constant values and logical AND and OR functions, 

b) forming current rules from combinations of items in the alphabet such that type 
consistency and variable consistency is preserved, 
15 c) evaluating the current rules for adequacy of classification of the training data set, 

d) if no current rule adequately classifies the training data set, generating new rules by 
applying at least one genetic operator to the current rules, a genetic operator having 
one of the following functions: i) combining two rules to form a new rule, ii) modifying a 
single rule by deleting one of its conditions or adding a new condition to it, or iii) 

20 changing one of a rule's constant values for another of an appropriate type, and 

e) designating the new rules as the current rules and iterating steps c) onwards' until a 
current rule adequately classifies the training data set or a predetermined number of 
iterations is reached. 

Data samples in the training data set may have characters indicating whether or not they 
25 are associated with anomalies. The at least one anomaly characterisation rule may have a 
form that an anomaly is detected or otherwise by application of such rule according to 
' whether or not a condition set of at least one condition associated with that rule is fulfilled, 
ft may be developed by refining a most general rule by at least one of: 
a) addition of a new condition, to the condition set; and 
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b) unification of different variables to become constants or structured terms. 

A variable in the at least one anomaly characterisation rule which is defined as being in 
constant mode and is numerical may be at least partly evaluated by providing a range of 
values for the variable, estimating an accuracy for each value and selecting a value having 
5 optimum accuracy. 

The computer apparatus may be programmed to filter out at least one of rule duplicates, 
rule equivalents and unnecessary 'less than or equal to' ("Iteq") conditions. It may be 
programmed to stop construction of a rule if at least one of three stopping criteria is fulfilled 
as follows: 

10 a) the number of conditions in any rule in a beam of rules being processed is greater than 
or equal to a prearranged maximum rule length, 

b) no negative examples are covered by a most significant rule, which is a rule that: 
i) is present in a beam currently being or having been processed, 

. ii) is significant, 

15 iii) has obtained a highest likelihood ratio statistic value found so far, and 

iv) has obtained an accuracy value greater than a most general ruie accuracy value, 
and 

c) no refinements were produced which were eligible to enter the beam currently being 
processed in a most recent refinement processing step. 



20 In a further aspect, the present invention provides computer software for use in anomaly 
detection characterised in ( that it incorporates instructions for controlling computer 
apparatus to execute the steps of> 

a) developing a rule set of at least one anomaly characterisation rule from a training data 
set and any available relevant background knowledge using at least first order logic, a 

25 ruie covering a proportion of positive anomaly examples of data in the training data set 

and 

b) applying the rule set to test data for anomaly detection thereim 

The computer software may incorporate instructions for controlling computer apparatus to 
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develop the rule set using Hfgher-Order logic. It may Incorporate instructions for controlling 
computer apparatus to develop the rule set by: 

J a) forming an alphabet having selector functions allowing properties of the training data 
set to be extracted, together with at least one of the following: additional concepts, 
5 background knowledge constant values and logical AND and OR functions, 

b) forming current rules from combinations of items in the alphabet such that type 
consistency and variable consistency is preserved, 

c) evaluating the current ruies for adequacy of classification of the training data set, 

d) if no current, rule adequately classifies the training data set, generating new rules by 
10 applying at least one genetic operator to the current rules, a genetic operator having 

one of the following functions: i) combining two rules to form a new rule, ii) modifying a 
single rule by deleting one of its conditions or adding a new condition to it, or Hi) 
changing one of a rule's constant values for another of an appropriate type, and 

e) designating the new rules as the current rules and iterating steps c) onwards until a - 
15 current rule adequately classifies the training data set or a predetermined number of 

iterations is reached. 

Data samples in the training data set may have characters indicating whether or not they 
are associated with anomalies. 

The at least one anomaly characterisation rule may have a form that an anomaly is 
20 detected'or otherwise by application of such rule according to whether or not a condition 
set of at least one condition associated with that rule is fulfilled. 

The computer software may incorporate instructions for controlling computer apparatus to 
develop the at least one anomaly characterisation rule by refining a most genera! rule by at 
least one of: 

25 a) addition of a new condition to the condition set; and 

b) unification of different variables to become constants or structured terms. 

The computer software may incorporate instructions for controlling computer apparatus to 
at least partly evaluate a variable in the at least one anomaly characterisation rule which Is 
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defined as being in constant mode and is numerical by providing a range of values far the 
variable, estimating an accuracy for each value and selecting a value having optimum 
accuracy. It may incorporate instructions for controlling computer apparatus to filter out" at 
least one of rule duplicates, rule equivalents and unnecessary 'less than or equal to 1 ("Iteq") 
5 conditions. It may also incorporate instructions for controlling computer apparatus to stop 
construction of a rule if at least one of three stopping criteria is fulfilled as follows: 

a) the number of conditions in any rule in a beam of rules being processed is greater than 
or equal to a prearranged maximum rule length, 

b) no negative examples are covered by a most significant rule, which is a rule that: 
10 i) is present in a beam currently being or having been processed, 

ii) is significant, 

iii) has obtained a highest iikelihood ratio statistic value found so far, and 

iv) has obtained an accuracy value greater than a most general rule accuracy value, 
and 

15 c) no refinements were produced which were eligible to enter the beam currently being 
processed in a most recent refinement processing step. • 

In order that the invention might be mare fully understood, an embodiment thereof will now 
be described, by way of example only, with reference to the accompanying drawings, in 
which;- 

20 Figure 1 Illustrates use of a computer to monitor supermarket cashiers' tills in 
accordance with the invention; 

Figure 2 is a flow diagram illustrating an automated, procedure implemented by the 
Figure 1 computer- for characterisation of fraudulent transactions in accordance 
with the invention; 

25 Figure 3 is another flow diagram illustrating generation of a rule set in" the Figure 2 
procedure for use in characterisation of fraudulent transactions; and 

Figure 4. is a further flow diagram illustrating generation of a rule set using Higher Order 
Logic. 

One example of an application of automated anomaly detection using the invention 
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concerns characterisation of retail fraud committed in shops by cashiers. The invention in 
this example may be used in conjunction with current commercial systems that can 
measure and record the amount of money put into and taken out of cashiers' tills. Various 
kinds of cashier behaviour may indicate fraudulent or suspicious activity. 

5 In this example of the invention transactions from a number of different cashiers' tills were 
employed. Each transaction was described by a number of attributes including cashier 
identity, date and time of transaction, transaction type (e.g. cash or non-cash) and an 
expected and an actual amount of cash in a till before and after a transaction. Each 
transaction is labelled with a single Boolean attribute which indicates "true" if the 

10 transaction is known or suspected to be fraudulent and "false" otherwise. Without access 
to retail fraud experts, definitions of background knowledge were generated in the form of 
concepts or functions relating to data attributes. One such function calculated a number of 
transactions handled by a specified cashier and having a discrepancy: here a discrepancy 
is a difference in value between actual and expected amounts of cash Sn the till before and 

15 after a single transaction. 

In this example, the process of the invention derives rules from a training data set and the 
definitions of basic concepts or functions associated with data attributes previously 
mentioned. It evaluates the rules using a test data set and prunes them if necessary. The 
rules so derived may be sent to an expert for verification or loaded directly into a fraud 

20 management system for use in fraud detection. To detect fraud, the fraud management 
system reads data defining new events and transactions to determine whether they are 
described by the derived rules or not. When an event or transaction is described by a rule 
then an alert may be given or a report produced to explain why the event was flagged up 
as potentially fraudulent. The fraud management system will be specific to a fraud 

25 application. 

Benefits of applying the invention to characterisation of telecommunications and retail fraud 
comprise: 

• Characterisations in the form of rule sets may be learnt automatically (rather 
than manually as in the prior art) from training data and any available 
30 background kndwledge or rules contributed by experts- this reduces costs and 
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duration of the characterisation process; 

• . Rule sets which are generated by this process are human readable and are 
readily assessable by human experts prior to deployment within a fraud 
management system; and 

5 * the process may employ relational data, which is common in particular 

applications of the invention - consequently facts and transactions which are in 
different locations and which are associated can be linked together. 

This example of the invention employs inductive logic programming software implemented 
in a logic programming language called Prolog. It has an objective of creating a set of rules 
10 that characterises a particular concept, the set often being called a concept description. A 
target concept description in thjs example is a characterisation of fraudulent behaviour to 
enable prediction of whether an event or transaction is fraudulent or not The set of rules 
should be applicable to. a new, previously unseen and uniabelled transaction and be 
capable of indicating accurately whether it is fraudulent or not. 

15 A concept is described by data which in this example is a database of events or 
transactions that have individual labels indicating whether they are fraudulent or non- 
frauduJent. A label is a Boolean value, 1 or 0, indicating whether a particular event or 
transaction is fraudulent (1) or not (0). Labelling transactions as fraudulent identifies 
fraudulent cashiers, which are then are referred to as positive examples of the target 

20 concept; Labelling transactions as non-fraudulent identifies non-fraudulent cashiers which 
are referred to as negative examples of the target concept 

In addition to receiving labelled event/transactional data, the inductive logic programming 
software may receive input of further information, i.e. concepts, facts of interest or functions 
that can be used to calculate values of interest e.g. facts about customers and their 
25 accounts and a function that can be used to calculate an average monthly bill of a given 
customer. As previously mentioned, this further information is known as background 
knowledge, and is normally obtained from an expert in the relevant type of fraud. 

As 'a precursor to generating a rule set r before learning takes place, the labelled 
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event/transaction and cashier data is randomly distributed into two non-overlapping 
subsets - a training data set and a test data set Here non-overiapping means no data 
item is common to both subsets. A characterisation or set of rules is generated using the 
training data set. The set of rules is then evaluated on the test data set by comparing the 
■5 actual fraudulent or otherwise label associated with a cashier with the equivalent predicted 
for it by the inductive logic programming software. This gives a value for prediction (or 
classification) accuracy - the percentage of correctly assessed cashiers in the test data 
set Testing on a different data set of hitherto unseen examples, i.e. a set other than the 
training data set, is a good indicator of the validity of the rule set 

10 The target concept description is a set of rules in which each rule covers or characterises a 
proportion of the positive (fraudulent) examples of data but none of the negative (non- 
fraudulent) examples, it is obtained by repeatedly generating individual rules. When a rule , 
is generated, positive examples which it covers are removed from the training data set. 
The process then iterates by generating successive rules using unremoved positive: 

15 examples, Le. those still remaining in the training data set After each iteration, positive -i 
examples covered by the rule most recently generated are removed. The process 
continues until there are too few positive examples remaining to allow another rule to be " 
generated. This is known as the sequential covering approach, and is published in 
Machine teaming, T. Mitchell, McGraw-Hill, 1997. 

20 Referring to Figure 1, an example of the invention involves use of a computer 1 to monitor 
cashiers 1 tills 3 in a supermarket {not shown). The computer 1 has an associated visual 
display unit 5 and printer 7. Referring now also to Figure 2, the computer 1 (not shown in 
Figure 2) Implements a process 10 involving running inductive logic programming software 
(referred to as an ItP engine) at 12 to characterise fraudulent transactions; such 

25 transactions are indicated by data which the computer 1 detects is anomalous. The 
process 10 inputs background knowledge 14 and a training data set 16 to the computer 1 
for processing at 12 by the ItP engine: this produces a set of rules 18. Rule set 
performance is evaluated at 20 using a test data set 22. 

Processing 12 to generate a set of rules is shown in more detail in Figure 3. Individual 
30 rules have a form as follows: 
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IF {set of conditions} THEN {behaviour is fraudulent} (1) 

A computer search for each individual rule begins at 30 with a most general rule (a rule 
with no conditions)': searching is iterative (as will be described later) and generates a 
succession of rules, each new mle search beginning at 30. The most general rule is: 

5 IF { } THEN target_predicafe is true (2) 

This most general rule is satisfied by all examples, both positive and negative, because it 
means that all transactions and facts are fraudulent, it undergoes a process of refinement 
to make it more useful. There are two ways of producing a refinement to a rule as follows: 

• addition of a new condition to the IF{ } part of the rule; 

10 • unification of different variables to become constants or structured terms; 

Addition of a new condition and unification of different variables are standard expressions 
for refinement operator types though their implementation may differ between systems. A 
condition typically corresponds to a test on some quantity of interest, and tests are often 
implemented using corresponding functions in the background knowledge. When a new 

15 condition is added to a rule, its variables are unified with those in the rest of the rule 
according to user-specified mode declarations. Unification of a variable X to a variable Y 
means that all occurrences of X in the rule will be replaced by Y. A mode declaration for a 
predicate specifies the type of each variable and its mode. A variable mode may be input, 
output, or a constant Only variables of the same type can be unifiedv Abiding by mode 

20 rules reduces the number of refinements than may be derived from a single rule and thus 
reduces the space of possible concept descriptions and speeds up the learning process. 
There may be more than one way of unifying a number of variables in a rule, in which case 
there will be more than one refinement of the rule. - 

For example, a variable X may refer to a list of items. X could be unified to a constant 
25 . value [ J which represents an empty list or to [Y|Z] which represents a non-empty list with a 
first element consisting- of a variable Y and having another variable 2 representing the rest 
of the list, instantiating X by such unification constrains its value. In the first case, X is* a 
list with no elements and in the second case it must be a non-empty list Unification acts to 
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refine variables and rules that contain them. 

Variables that are defined as being in constant mode must be instantiated by a constant 
value. Variables of constant type can further be defined by the user as either non- 
numerical or numerical constants. 

5 If a constant is defined as non-numerical then a list of possible discrete values for the 
constant must also be specified by a user in advance. For each possible value of the 
constant, a new version of an associated refinement is created in which the value is 
substituted in place of the corresponding variable. New refinements are evaluated using 
an appropriate accuracy estimate and the refinement giving the best accuracy score is 
10 recorded as the refinement of the original rule, 

If a constant is specified as numerical, it can be further defined as either an integer or a 
floating-point number. A method for calculating a best constant in accordance with the*. 
. invention applies to both integers and floating point numbers. If a constant is defined as^ 
numerical then a continuous range of possible constant values must be specified by a user. 
15 in advance. For example, if the condition was "minutes_past_the_hour(XJ" then X could:: 
have a range 0-59. 

In an integer constant search, if a range or interval length for a particular constant is less 
than 50 in length, all integers (points) in the range are considered. For each of these 
integers, a new version of a respective associated refinement is created in which the 
20 relevant integer is substituted in place of a corresponding variable and new rules are 
evaluated and given an accuracy score using an appropriate accuracy estimation 
procedure. The constants) giving a best accuracy score is(are) recorded. 

If the integer inten/al length is greater than 50, then the computer 1 carries out a recursive 
process as follows: 

25 1 . A proportion of the points (which are evenly spaced) in the interval length are sampled 
to derive an initial set of constant values. For example, in the 
"minutes_past_the_hour(X)" example, 10, 20, 30, 40 and 50 minutes might be sampled. 
For each of these values, a new version of a respective refinement is created in which 
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the value is substituted in piace of a corresponding variable and a respective rule is 
evaluated for each value together with an associated accuracy estimate. 

2. a. If a single constant value provides the best score then a number of the values (the 
number of which is a user selected parameter In the iLP engine 12) either side of this 
5 value are sampled. For instance, if the condition minutes _past_the__hour(20) gave the 

best accuracy then the following more precise conditions may then be evaluated: 

• minutes _j}ast_the_hour(l5) 

• minutes _pastJthe_hour(1 6) 

• minutes _past_the_hour(17) 
to • minutesjpasijthe^hourfl 8) 

• minutes _pastJhe_hour(1 9) 

• minutes _past_Jhe__hour(21) 

• minutes _past_Jhe_hour(22) 

• minutes jpast_the_hour(23) 
15 • minutes jjast_the_hour(24) 

■ minutes jJBstJ:he_hour(25) 

If a singfe constant value in X = 15 to 25 gives the best accuracy score then that value is 
chosen as a final value of the constant X. 



2. If more than one constant value provides the best score then if they are consecutive 
20 points in the sampling then the highest and lowest values are taken and the values in 

their surrounding intervals are tested. For exampfe, if minutes_pastj:he_hour(20) t 
minutes _£>asi_the_hour(30) and minutes _past_Jhe_hour(40) all returned the same 
accuracy then the following points would be tested for accuracy ; 

•-^ minutes _past_Jhe_hour(1 5) 

25 * minutes j3asijthe_hour(16) 
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♦ minutes _j^astJthe_hour(17) 

• minutes jpastJ:heJiour(18) 

• minutes jjast^the_hour(19) 

* minutes jpast_JheJhour(41) 
5 * minutes _j^ast_the_hour(42) 

• minutes jpastJthe_hour(43) 

* minutes _past_the_hour(44) 

* minutes _pastJ:he_hour(45) 

If the accuracy score decreases at an integer value N in the range 15 to 19 or 41 to 45, 
10 then (N-1 ) is taken as the constant in the refinement of the relevant rule. 

2. c. If a plurality of constant values provides the best accuracy score, and the values are 
not consecutive sampled points then they are arranged into respective subsets of 
consecutive points. The largest of these subsets is '"selected, and the procedure for a 
list of consecutive points is followed as at 2b above: e.g. if minutes jpastJtheJiour(2Q) r 
15 minutes jpastjhejhour(30) and minutes _past_the_hour(50) scored best then the 
subset minutes _j)ast_the_hour(20) - minutes_pastJheJiQUr(30) would be chosen. If 
the largest interval consists of only one value, then the procedure for a single returned 
value is followed as at 1. above. 



The user can opt to conduct a beam constant search: here a beam is an expression 
20 describing generating a number of possible refinements to a rule and recording all of them 
to enable a choice to be made between them later when subsequent refinements have 
been generated. In this example, N refinements of a rule, each with a different constant 
value are recorded. This can be very effective, as the 'best* constant with highest accuracy 
at one point in the refinement process 32 may not turn out to be the 'best' value over a 
25 series of repeated refinement iterations. This avoids the process 32 getting fixed in local 
non-optimum maxima. 

Some variables In conditions/rules may be associated with multiple constants: if so each 
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constant associated with such a variable Is treated as an individual constant, and a 
respective best vaJue for each is found separately as described above. An individual 
constant value that obtains a highest accuracy score for the relevant rule is kept and the 
corresponding variable is instantiated to that value. The remaining variables of constant 
5 type are instantiated by following this process recursively until all constant type variables 
have been instantiated (i.e. substituted by values). 

Once ali refinements of a rule have been found, in accordance with the invention, the 
computer 1 filters refinements at 34 to remove any rules that are duplicates or equivalents 
of others in the set. Two rules are equivalent in that they express the same concept If their 
to conditions in the IF {set of conditions) part of the rule are the same but the conditions are 
ordered differently. For example, IF {set of conditions} consisting of two conditions A and B 
is equivalent to IF {set of conditions} with the same two conditions in a different order, i.e. B 
and A. One of the two equivalent rules is removed from the list of refinements and so is not 
considered further during rule refinement, which reduces the processing burden. 

15 Additionally, In accordance with the invention, symmetric conditions are not allowed in any 
rule. For example, a condition equa!(X,2) means a variable X is equal in value to 2, is 
symmetric to equal(2 3 X), i.e. 2 is equaf in value to a variable X. One of the two symmetric 
rules is removed from the list of refinements and so is not considered further. 

Pruning refinements to remove equivalent rules and symmetric conditions results in fewer 
20 rules for the computer to consider at successive iterations of the refinement process 32, so 
the whole automated rule generation process is speeded up. Such pruning can reduce rule 
search space considerably* albeit the extent of this reduction depends on what application 
] S envisaged for the invention and how many possible conditions are symmetric: in this 
connection where numerical variables are involved symmetric conditions are usually 
25 numerous due to the use of 'equals' conditions such as equaI(Y,X). For example, in the 
retail fraud example, the rule search space can be cut by up to a third. 

A 'less than or equals' condition referred to as 'Iteq', and an 'equals* conditions are often 
used as part of the background knowledge 14. They are very useful conditions for ' 
comparing numerical' variables within the data. For this reason, part of the filtering process 
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34 ascertains that equals and Iteq conditions in rules meet checking requirements as 
follows: 

• End of interval check: the computer checks the end of Intervals where constant 
values are involved: e.g. a condition lteq(A, 1000) means variable A is less than or 

5 equal to 1000: it is unnecessary if A has a user-defined range of between 0 and 

1000, so a refinement containing this condition is removed, in addition, Iteq(1000, 
A), 1000 is less than or equal to A, should be equals(A, 1000) as A cannot be more 
than 1000. Therefore, refinements containing such conditions are rejected. 

* Multiple Iteq' predicate check: if two conditions lteq{A,X) and lteq(B,X) where A 
10 and B are constants, are contained in the body of a rufe, then one condition may be 

removed depending on the values of A and B. For example, if lteq(30 s X) and* 
lteq(40,X) both appear in a rule, then the computer removes the condition lteq(30,X) 
from the rule as being redundant, because if 40 is less than or equal to X then so< 
also is 30. 

15 • Equals and Iteq duplication check: in accordance with the invention if the body of a 
rule contains both conditions lteq(C, Constant) and equals(C, Constant), then only 
.the equals condition is needed. Therefore, refinements containing iteq conditions 
With associated equals conditions of this nature are rejected by the computer. 

Rule refinements are also filtered at 34 by the computer using a method called 'Encoding 
20 Length Restriction' disclosed by N. Lavrac and S, Dzeroski, Inductive Logic Programming: 
Techniques and Applications. Ellis Horwood, New York, 1994. It is based on a 'Minimum 
Description Length 1 principle disclosed by B. Pfahringer, Practical Uses of the Minimum 
Description Length Principle in Inductive Learning, PhD Thesis, Technical University of 
Vienna, 1995. 

25 Where training examples are noisy (Le. contain incorrect or missing values), it is desirable 
to ensure that rules generated using the invention does not overfit data by treating noise 
present in the data as requiring fitting. Rute sets that overfit training data may include 
some very specific rules that only cover a few training data samples. In noisy domains, it is 



01-NOU-2004 13=48 FROM IP MALUERN TO UK PATENT P. 25 

_ ; . : 



20 

[fkeiy that these few samples will be noisy: noisy data samples are unlikely to indicate 
transactions which are truly representative of fraud, and so rules should not be derived to 
cover them. 

The Encoding Length Restriction avoids overfitting noisy data by generating a rule 
5 refinement only if the cost of encoding the refinement does not exceed the cost of encoding 
the positive examples covered by the refinement here 'cost 1 means number ,of bits, A 
refinement is rejected by the computer if this cost criterion is not met. This prevents rules 
becoming too specific, i.e. covering few but potentially noisy transactions. 



Once a rule is refined, the resulting refinements are evaluated in order to identify those 
10 which are best. The computer evaluates rules at 36 by estimating their classification 
accuracy. This accuracy may be estimated using an expected classification accuracy 
estimate technique disclosed by N. Lavrac and S. Dzeroski, Inductive Logic Programming, 
Techniques and Applications. Ellis Horwood, New York, 1994, and by F. Zelezny and N, 
Lavrac, An Analysis of Heuristic Rule Evaluation Measures, J. Stefan Institute Technical 
15 Report, March 1999. Alternatively, it may be estimated using a weighted relative accuracy 
estimate disclosed by N, Lavrac, P. Flach and B. Zupan, Rule Evaluation Measures: A 
Unifying View, Proceedings of the 9th International Workshop on Inductive Logic 
Programming (ILP-99), volume 1634 of Lecture Notes in Artificial Intelligence, pages 174- 
185, Springer- Verlag, June 1999. A user may decide which estimating technique is used 
20 to guide a rule search through a hypothesis space during rule generation. 



Once refinements have been evaluated in terms of accuracy, they are then tested by the 
computer for what is referred to in the art of rule generation as 'significance 1 . In this 
example a significance testing method is used which is based on a likelihood ratio statistic 
disclosed in the N. Lavrac and S, Dzeroski reference above. A rule is defined as 
25 'significant 1 if its likelihood ratio statistic value is greater than a predefined threshold set by 
the user. 

If a rule covers n positive examples and m negative examples, an optimum outcome of 
refining the rule is that one of its refinements (an optimum refinement) will cover n positive 
examples and no negative examples. A likelihood ratio for this optimum "refinement can be 
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calculated by the computer. A rule is defined as 'possibly significant 1 if its optimum 
refinement is significant. Arising from this definition, ft is possible that a rule may not 
actually be significant but it may be possibly significant. 

The computer checks a rule under consideration in the process 12 at 38 to see whether or 
5 not it meets rule construction stopping criteria: in this connection, the construction of an 
individual rule terminates when the computer determines that any one or more of three 
stopping criteria is fulfilled as follows: 

1, the number of conditions in any rule in a beam (as defined earlier) currently being 
processed is greater than or equal to a maximum rule length specified by the user, 

10 If a most significant rule (see at 2. below) exists this is added to the accumulating 

rule set at 40, 

2, a most significant rule covers no negative examples - where the most significant 
rule is defined as a rule that is either present in the current beam, or was present in 
a previous beam, and this rule: 

15 a) is significant, 

b) obtained the highest likelihood ratio statistic value found so far, and 

c) obtained an accuracy value greater than the accuracy value* of the most 
genera] rule (that covers all examples, both positive and negative), and 

3, the previous refinement step 32 produced no refinements eligible to enter the new 
20 beam; if a most significant rule exists it is added to the accumulating rule set at 40, 

Note that a most significant rule may not necessarily exist, if so no significant refinements 
have been found so far if it is the case that a most significant rut© does not exist but the 
stopping criteria at 38 are satisfied, then no rule is added to the rule set at 40 by the 
computer and the stopping criteria at 44 are satisfied (as will be described later}. 

25 When a rule is added at 40, the positive examples it covers are removed from the training 
data by the computer 1 at 42, and remaining or unremoved positive and negative examples 
form a modified training data set for a subsequent iteration (if any) of the rule search. 
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At 44 the computer 1 checks to see whether or not the accumulating. ru)e set satisfies 
stopping criteria. In this connection, accumulation of the rule set terminates at 46 (finalising 
the rule set) when either of the following criteria is. fulfilled, that is to say when either: 

• construction of a rule is terminated because a most significant rule does not exist, 
5 or 

• too few positive examples remain for further rules to be significant. 

If at 44 the accumulating rule set does nor satisfy the rule set stopping criteria, the 
computer 1 selects another most general rule at 30 and accumulation of the rule set 
iterates through stages 32 etc. At any given time in operation of the rule generation 
10 process 12, there are a number (zero or more) rules for which computer processing has 
terminated and which have been added in the accumulating rule set, and there are (one or 
more) evolving rules or proto-ruies for which processing to yield refinements continues 
Heratively. 

If evolving ruies are checked at 38 and are found not to meet any of the rule construction 
15 stopping criteria previously mentioned, those refinements of such ruies are chosen which 
have the best accuracy estimate scores. The chosen refinements then provide a basis for 
a next generation of rules to be refined further in subsequent refinement iterations. The 
user defines the number of refinements forming a new beam to be taken by the computer 
to a further iteration by fixing a parameter called 'beam^width 1 . As has been said, a beam 
20 is a number of recorded possible refinements to a rule from which a choice wilt be made 
iater, and beam_width Is the number of refinements in it For a beam width N, the 
refinements having the best N accuracy estimate scores are found and taken forward at 48 
as part of the new beam to the next iteration. The sequence of stages 32 to 38 then iterates 
for this new beam via a loop 50. 

25 Each refinement entering the new beam must: 

• be possibly significant (but not necessarily significant), and 

• improve upon or equal the accuracy of its parent rule (the rule from which it was 
derived by refinement previously). 
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tf required by the user, the accumulated ruie set can be post-pruned by the computer using 
a reduced error pruning method disclosed by J. FQrnkranz, "A Comparison of Pruning 
Methods for Relational Concept Learning", Proceedings of AAAP94 Workshop on 
Knowledge Discovery in Databases (KDD-94), Seattle, WA, 1994. In this case, another set 
of examples should be provided - a pruning set of examples. 

Examples of a small training data set, background knowledge and a rule set generated 
therefrom will now be given. In practice there may be very large numbers of data samples 
in a data set 

Training data 

The training data is -a transaction database, represented as Prolog facts in a format as' 
follows; 

trans(Trans ID, Date, Time, Cashier, Expected amount in till, Actual amount in till, 
Suspicious Flag), Here 'trans' and Trans' mean transaction and (D means identity. 
15 A sample of an example set of transaction data is shown below. Transactions with' 

Suspicious Flag = 1 are fraudulent (positive examples), and with Suspicious Flag = 0,; 
are not (negative examples)The individual Prolog facts were: 
trans(1,30/68/2003,09:02,cashierJ ,121.87,123,96, 0). 
trans(2,30/08/2003 f 08:56,cashier_1 ,1 19.38,121 .82, 0). 
20 trans(3,30/08/2003 r 08:50,cashter_1 ,118.59,1 19.38, 0). 

trans(4 ) 30/08/2003,08:48,cashieM ,116,50,1 18.59, 0). 
tran5<5,30/Q8/2003,Q6:44,cashierJ ,1 1 5.71 ,1 16.50, 0), 
trans(6,30/08/2003,22:40 f cashier_2,431 .68,435.1 7, 0). 
trans(7,30/08/2003,22:37,cashier_2 s 423T0 ? 431.68, 1). 
25 trans(8 s 30y08/2003,22:35 r cashier_2,420.01 ,423T0 V 0}. 

These labelled transactions indicate that cashier_2 is suspected to have been fraudulent 
because the Suspicious Flag in the seventh of the above lines is 1 , while cashier 1 is not 
giving us the following Prolog facts or statements: 

30 :- Frauduient(cashieM). 
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Fraudulent{cashier_2). 

The first statement is specifying that cashieM is not a fraudulent cashier because ft begins 
with a minus sign. This is because the suspicious transaction flag is set to 0 for all of. the 
transactions associated with cashieM . Cashier_2 however has the Suspicious Flag set to 
5 1 for one of the transactions associated with it, and therefore the second statement is 
specifying that cashierJZ is thought to be fraudulent, These provide positive and negative 
examples for learning the concept of a fraudulent cashier. 

Background knowledge: this includes tests that are thought to be appropriate by a 
10 domain expert. Examples of appropriate background knowledge concepts, represented 
using Prolog, are: 

discrepancy{Trans_ID 3 Discrepancy). 

This gives the discrepancy in UK £ and pence between the expected amount of cash 
in a till and the actual amount of cash in that til! for a particular transaction identity 
15- (TransJD), e.g.: 

discrepancy^, 2.09), 
discrepancy(2, 2.44). 
. dfecrepancy(7, 7.98). 

totaUxans(Cashier number, Total number of transactions, Month/Year), 
20 This gtves the total number of transactions made by the cashier in a given month of a 

year, e.g.: 

totalJrans(cashiar_1, 455, 08/2003). 

totaMrans(cashier_2, 345, 08/2003), 

number__of_trans_with_discrepancy(Cashier r Number, Month/Year). 
25 This gives the total number of transactions with a discrepancy made by a cashier in a 

given month of a year, e.g.: 

number_of_trans_with_discrepancy(cashieM, 38, 03/2003). 
number_of_trans_with_discrepancy(cashier_2 F 93, 08/2003), 

. * •> > t 

30 number_oOrah$_with_discrepancyjreater_than(Cashier, Number, Bound, 

Month/Year^ 

This gives the total number of transactions with a discrepancy greater than some 
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bound made by a cashier in a given month of a year. e,g,: 
number_of_transjwit^^^ 
numberj3fj:rans_with_di5crepan 

number„oMrans_with_dis^^ 
5 number_oOrans_with_discrepancy_greater_than(cashier_2,2,200,08/2003). 

discrBpancy(TransJ D, Discrepancy). 

This gives the discrepancy between the expected amount of cash in the till and the 
actual amount of cash in the till for a particular transaction, e.g,: 

discrepancy(1, 2.09). 
10 discrepancy(2, 2A4). 

discrepancy(7, 7.98). 

totaIJrans(Cashier, Total number of transactions, Month and Year). 

This gives the total number of transactions made by the cashier in a given month and 

yean e.g.: 

15 totaUrans(cashier_1 , 455, 08/2003). 

total jrans(cashier_2, 345, 08/2003), 

number_ofJrans_with__discrepancy(Cashier I Number, Month/Year). 

This gives the total number of transactions with a discrepancy made by a cashier in a 

'20 given month of a year, e.g.: 

number_oMrans_with_discrepancy(cashieM, 38, 08/2003). 
number_ofJrans_with_discrepancy(cashier_2, 93, 08/2003). 

number_oOrans_with_discrepancy_greaterJhan(Cashier, Number, Bound, 
Month/Year). 

This gives the total number of transactions with a discrepancy greater than some 
bound made by a cashier in a given month of a year, e.g.: 
number_oMrans_wlth_d!sc 
number_ofJrans_with_d^ 

number_ofjrans_with_discrepancy_greaterjhan(cashier_2,15,100 r 08/2003) 
.number^oMrans^with^discrepancy^greater^thanCcashier^^^OO.OS^OOa) 



25 
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Generated rule set: 

The target concept is fraudulent(Cashler). The rule set characterises a cashier who has 
made fraudulent transactions. 
fraudulent(Cashier) 

5 number_oMrans_with_discrepancy_greater_than(Cashier a Discrepancies, 100, 

Month), 

Discrepancies £1 0, 
fraudulent(Cashier) :- 

totalJrans(Cashier, Tofcaljrrans, Month), 
10 TotaI_Trans £ 455* 

number„of_trans„with_discrepancy(Cashier, Discrepancies, Month), 
Discrepancies > 230, 

This example of a generated rule set characterises fraudulent cashiers using two rules. 
The first rule indicates that a cashier is fraudulent if that in a singfe month, the cashier has 
15 performed at least 10 transactions with a discrepancy greater than 100. 

The second rule describes a cashier as fraudulent tf in a single month, the cashier has 
carried out at least 466 transactions, where at least 230 of these have had a discrepancy 
between the expected amount and the actual transaction amount. 

The embodiment of the invention described above provides the following benefits: 

20 » speed of operation because it prunes out redundancy arising from duplicated rules 

and avoids associated unnecessary processing! 

• capability for dealing with and tune numerical and non-numerical constants to 
derive rules that bound variables (e.g. IF transaction value is between £19.45 and 
£67.89 THEN „.); 

2$ • capability for making use of many different heuristics (decision techniques e,g. 

based on scores for accuracy), which can be changed and turned on or off by a 
user; 

• a weighted relative accuracy measure is used in rule generation; • 

• capability for developing rules that are readable and its reasoning can be 
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understood (unlike a neural network for example); 
■ capability for be tuning to a particular application by adjusting its parameters and 
changing/adding heuristics; 

* capability for use relational and structural data that can be expressed in Prolog; 
5 * capability for processing numerical and non-numerical data; and 

• capability for making use of expert knowledge encoded in Prolog, ' 

In the embodiment of the invention described with reference to Figures 1 to 3, expression 
of characterisations of anomalous (e.g. fraudulent} behaviour in data was in First-Order 
Logic (e.g. Prolog programs). This is not essential. The characterisations may also be 
expressed in Higher-Order Logic using a programming language such as Escher: J.W. 
Lloyd (1999) "Programming in an Integrated Functional and Logic Language", Journal, of 
Functional and Logic Programming 1999{3). As increasingly complex problems are 
tackled, a more intricate approach is desirable. Escher is a functional logic language 
whose higher-order constructs allow arbitrarily complex observations to be captured and 
highly expressive generalisations to be conveyed. The Higher-Order Logic arises from 
logic functions and predicates being allowed to take other functions and predicates as 
arguments: it provides a natural mechanism for reasoning about sets of objects. 

Rules characterising anomalous behaviour may be automatically developed using a 
learning system that learns rules expressed in Higher-Order Logic such as the Strongly 
20 Typed Evolutionary Programming System (STEPS): see C.J. Kennedy Ph.D. Thesis 
(2000), Department of Computer Science, University of Bristol. England. 

STEPS alleviates the challenging problem of identifying an underlying structure for 
searching the resulting hypothesis space efficiently. This is achieved through an 
evolutionary based search that allows the vast space of highly expressive Escher programs 
25 to be explored. STEPS provides a natural upgrade of the evolution of concept descriptions 
to the higher-order level. 

in particular STEPS uses what is referred to as an 'individuals-as-terms 1 approach to 
knowledge representation: this approach localises all information provided by an example 
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as a single item or expression incorporating a set of elements characteristic of that 
example. For example, in the preceding embodiment of the invention, the problem domain 
is concerned with characterising fraudulent cashiers. Using the individuais-as-terms 
representation, all information relatingd to an Individual cashier is combined into a single 
5 item. Such information is the cashier's identifying number or id and the respective 
transactions that the cashier has generated. Therefore each example consists of the 
cashier's id and a list of its transactions expressed as a single tuple (generic name for pair, 
triple eta), ©.fl» 

(cashier! ,[{1 ,(30 l 8,2003) l (09:02) r 121 .87,123.96), ... (5,(30,8,2003),(08:44),1 1 5.71 ,1 16.5)]) 

10 This differs from the approach described in the preceding example where transactions 
were presented as separate Proiog facts. 

The individuals-as-tenms representation allows examples of arbitrary complexity to be 
treated in a uniform manner. STEPS also supports A-abstractions as arguments to higher- 
order functions thus enabling the creation of new functions not contained in an .original 
15 alphabet. Finally, STEPS provides a number of specialised genetic operators for rule 
generation. 

Rules learnt by STEPS are of the form: 

IF {set of conditions} THEN {behaviour is anomalous} ELSE {behaviour is not anomalous } 
This form is referred to as the rule template. 

20 Referring to Figure 4, a first step in a computer-implemented process 60 for generating or 
searching for rules is to use training examples 62 to create an alphabet 64 from which to 
construct the rules. This alphabet includes selector functions that allow properties of the 
training examples to be extracted so that comparisons and inferences can be made. 
Training examples are formed by using constructs known as datatypes such as lists, sets 

25 and tuples. Items contained in the lists, tuples and sets are referred to as components of 
the datatypes . The selector functions are automatically generated based on the datatypes 
of the training examples using an algorithm referred to as "AdaptedEnumerate" (see 



l-NOU-2004 13:51 FROM IP MALUERN 



TO UK PATENT 



P. 34 



29 

Kennedy reference above). Once the components of the datatypes have been selected, 
conditions can be built on them or they can be compared to values or other data types in 
the rules. In addition to the selector functions, the alphabet 64 consists of any additional 
concepts and facts of interest (background knowledge) expressed as Escher functions and 
constant values that may be extracted from training examples or specified by a user in 
advance. The background knowledge typically includes Boolean functions known as 
conjunction and disjunction (logical AND and OR). These functions can be used to 
combine a number of conditions and/or comparisons in a rule. 

Once the alphabet has been compiled at 64 and input to the computer 1 in Figure 1, the 
computer carries out an evolutionary search to produce a set of rules as follows. It forms a 
new or initial population of rules at 66 by combining components of the sfphabet to form 
conditions of a number of rule templates, and an iteration count index G is set to 1, To 
implement this, the components of the alphabet are randomly combined, but in such a way 
that only valid Escher functions are formed. This is achieved by maintaining type 
consistency and variable consistency, defined by> 

Type consistency: a function argument must be of a type for which the function was 
defined; e.g. if the function f(x,y) = x + y takes integers as its arguments x and y then 
by letting x become the value 4 and letting y became the value Red so that the 
function becomes "4 + Red": this violates the type consistency constraint and cannot 
be Incorporated in a rule. 

Variable consistency: all local variables must be within the scope of a quantifier. The 
quantification of a variable in this context is logical terminology for specifying a range 
of values that the variable may take, e.g. in the following example the local variable x 
has been quantified (using a syntax \x -> meaning of 'there exists a variable x such 
thaf) by stating that ft Is an element of the list t (t is a global variable representing a 
list and does not need to be quantified itself); but the local variable y has not been 
quantified, therefore the variable consistency constraint has been violated: e.g.: 

\x -> (elem x t) && x + y >= 2 
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Once the set of conditions of the rule templates has been completed, the ruies so formed 
are evaluated by the computer at 68 by applying them to the training examples 62 in order 
to estimate the accuracy by which they classify the training examples as anomalous or not - 
This establishes their fitness," i.e. it identifies which of the rules are best at classifying the 
training examples. At 70, a check is made to determine whether or not one of two 
termination criteria is met, i.e. if either 

1. a prearranged number of iterative search steps has been carried out, or 

2. a rule that adequately classifies all of the training examples has been found to 
within a prearranged accuracy. The accuracy will not necessarily be 100% because 
that may result in noise contained in example data having too much effect 

ff neither of the termination criteria is met, the computer begins a procedure to generate 
improved rules by using genetic operators to create a new population of rules from the 
previous population created at 66. A population count index is reset to zero at 72, and at 
74 a check is made to determine whether or not the new population is complete. If the new 
15 population is not complete, a rule is selected at 76 from the previous population. A rule is 
selected from the previous population using tournament selection. To perform a 
tournament selection, a subset of the previous population rules is randomly selected, and 
the rule in the subset with the highest fitness (classification accuracy) is the winner and is 
selected. Each of the previous population rules has the same probability of being selected 
20 for the subset. 

A genetic operator (see Kennedy reference above) is now selected by the computer at 78, 
It has one of the following functions: a) combining two ruies to form a new rule, b) 
modifying a single rule by deleting one of its conditions or adding a new condition to it, or c) 
changing one of a rule's constant values for another of an appropriate type. Genetic 
25 operators are applied in such a way as to maintain type and variable consistency for rules, 
A check is made at SO to determine whether or not the genetic operator selected at 78 has 
the function of combining two rules to form a new rule. If so, another rule is selected at 82 
by implementing a further tournament using a new randomly selected subset of the 
previous population, and the two-rule genetic operator is applied at 84 to the pair of rules 
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selected at 78 and 82. If not r the single-rule genetic operator Is applied at 84 to the rule 
selected at 78. In either case a population count is -then incremented by 1 at 86, 

The process of stages 74 to 86 iterates until the population count indicates that the new 
population of rules is of prearranged number, e.g. perhaps but not necessarily equal to the 
5 population size created at 66, When the prearranged number is reached, the iteration 
count G is incremented by 1 at 88 and then the evaluate population stage 68 is triggered to 
evaluate the new population. The two termination criteria are checked once more at 70. 
The procedure continues if neither criterion is met, i.e. if the iteration count G has not 
reached a prearranged maximum, or a rule has not been found that adequately classifies 
10 the training examples. 

If one or both of the termination criteria are met, the computer terminates the rule search 
60. The computer determines the best performing rule (i.e. giving the best classification 
accuracy with the training examples) by testing at 90, and its classification accuracy with 
one or more conditions removed is determined. To remove redundancy, it prunes (deletes 
15 from the rule) conditions that when removed do not alter the accuracy of the rule, and the 
pruned rule is designated as a result at 92. Although this is a single rule, the ability to USe 
the Boolean logical OR function in the rules makes it possibie for such a rule to be 
equivalent to a number of rules obtained in the preceding example. 

Using data from the embodiment described with reference to Figures 1 to 3, to characterise 
20 a fraudulent cashier, an individuais-as-terms representation used by STEPS groups the 
transactions associated with each cashier into a list: 

frauduient{(cashier1 l [{1 J (30 I 8 l 2Q03) s (09:02) s 121 .87,123.96), ... , 

(5,(30,8 f 2003),{08:44),1 15.71,1 16.5}])) * False; 

■ fraudulent((cashier2 I [(6 f (30,8 l 2003) l (22:40),431 .68,435.1 7), 

25 (8,(30,8 I 2003) I (22:35),420.01 ,423.7)])) - True; 

Therefore the selector functions generated for this problem include the ability to select 
transactions from the lists, to obtain sub-lists with transactions that have certain properties, 
to obtain the length of such sub-lists. The transactions themselves are tuples with five 
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positions each corresponding to a respsctivs datatype of which there are five; the selector 
functions allow each of the five datatypes to be extracted and processed using various 
conditions. Such a condition may consist of comparing datatype values to another or may 
he application of a background knowledge concept. The background knowledge may 

5 contain the discrepancy concept represented as an Escher function; this discrepancy 
function takes a transaction and returns the discrepancy between the expected amount of 
cash in the till and the actual amount Background knowledge can be embodied as a 
calculation in this way. The background knowledge in this case is that a discrepancy might 
b e a u $eful thing to look at when constructing the rules: if so, it is necessary to find the size 

10 of the discrepancy that is sufficient to indicate fraud. The additional concepts provided in 
the ILP case may be constructed from the discrepancy function with the selector functions 
that have been automatically generated during the rule construction process. For example, 
identification of the number of transactions made by a cashier in a given month and year 
may be achieved using the following Escher fragment (the variable x is global to the 

15 function in which the fragment would be contained and is therefore not further quantified): 
length (filter (\y -> (y 'elem' (proj2 x) && ((proJ2 (proj2 y) == Month) 
&& (proj3 (proj2 y) == Year))))): 

In the above expression, a filter function creates a list of *y's that meet a number of 
criteria. First the ys are quantified: *\y -> y 'elem' (proj2 x)' specifies that the items in 

20 the list (represented by the variable y) are the transactions associated with a cashier. 

The proj2 y function projects onto a second datatype that makes up an example (the 
example is represented by the global variable x). The cashier's id is a first datatype 
and the second datatype is a list of transactions associated with the cashier. The 
filter function is used to filter out transactions that meet two criteria. The first criterion 

25 is that the transactions fall within a given month: '(proj2 (pro]2 y) == Month)'. The 

variable y has been quantified to be a transaction. A transaction is itself a tuple with 
five positions, the second position of the transaction tuple (obtained by applying the 
'proj2' function) specifies the date as a triple (three position tuple), the second 
position of which contains the month (obtained by applying a further 'proj2' function). 

30 The month is then compared to a given month 'Month' (using the Month' 

function). The second criterion is that the transactions that make up the list fall within 
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a given year: '(proj3 (proj2 y) == Year)'. The date triple is obtained in the same 
manner as described above, but this time it is the third element of the date, the year, 
that is of interest (obtained using the 'proj3' function). The year contained in the date 
triple is then compared to a given year 'Year'. The length of this filtered list of 
transactions is then obtained using the ■length' function to provide the number of 
transactions that meet the specified criteria. 

In order to identify the total number of transactions with a discrepancy made by a cashier in 
a given month, the following is used: 

length (filter (\y -> (y 'elem' (proj2 x) &&( ((proj2 (proj2 y) == Month) 
&& (proj3 (proj2 y) =- 2003)) && (( discrepancy y) 1= 0 ))))); 

Here again in the above expression a filter function is used to obtain a list of 
•transactions (represented by the variable y) that meet two criteria. The first criterion 
is the same as above, the transactions ail fall in a given month (obtained using 
'(proj2 (proj2 y) — Month)'). In this case the second criterion uses the discrepancy 
function (specified as background knowledge) to obtain the discrepancy between 
the expected amount of cash in the till and the actual amount during transaction y 
(using 'discrepancy y'). The value obtained by this function is then tested to check 
that it is not equal to zero (using '((discrepancy y) != 0 )'). The length function is 
then used to obtain the length of the list containing (and hence number of) 
transactions that occur within a given month that have a nonzero discrepancy 
between the expected amount of cash in the till and the actual amount. 

The rule set presented in the eariier example can then equivalents be expressed in Escher 
as follows: 

fraudulent(cashier) = if 

(length (filter (\y -> (y 'elem' (praj2 x) &&( 
((proj2 (proj2 y) == Month) && 
(( discrepancy y) > 100 )))))) 
>=10 
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then True 
else False; 

thfs rule expresses that, if the number of transactions associated with a 
cashier, carried out in a given month with a discrepancy greater than or 
5 equal to 100, is greater than or equal to 10, then the cashier is fraudulent 

fraudulent(cashier) = if 

(length (filter (\y -> (y 'elem' (proj2 x) &&( 
((proj2 (proj2 y) == Month) && 
(( discrepancy y) != 0 )})))) 
10 >= 455 

then True 
else False; 

This rule expresses that, if the number of transactions with non-zero 
discrepancy associated with an individual cashier and carried out in a given 
15 month is greater than or equal to 455, then the cashier is fraudulent. 

Another embodiment of the invention concerns characterisation of software vulnerabilities 
in a disassembled equivalent of binary code by code anomaly detection. It may be used in 
conjunction with current commercially available systems that can disassemble binary code. 
In this embodiment, disassembly of a program in binary code it is a process which 
20 retrieves an assembly language equivalent of the program. Disassembly is to facilitate , 
human understanding during development of a rule set; however it is not essential and 
once rules have been learnt in assembly language they may be translated to operate 
directly on binary program code. 

Various kinds of fragments of code msfy indicate a vulnerability in a software application 
25 which is potentially exploitable by an unauthorised intruder. The most common form of 
vulnerability is a buffer overflow. Strings In the C programming language are sequences of 
bytes, with a zero indicating the end of a string. This allows strings to be of unlimited 
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length. However, memory is in limited supply, so fixed-size buffers must be allocated to 
hold strings. Copying a string of unknown (potentially unlimited) length into a buffer of fixed 
size can cause errors. If a C function known as strcpy Is used, then the string will be 
copied even if it overflows its allocated space, in a C program it is typical to allocate fixed 
5 buffers for strings on a computer stack. This means that overflowing the buffer will 
overwrite a part of the stack not allocated to the buffer. 

C is a procedural language that involves many function calls. Function calls are usually 
implemented on a computer (at a low level) by putting on to the stack a code address to 
return to after the call. Nested and recursive function calls may be implemented in this 
10 way. However, this approach enables a buffer overflow to overwrite the return address on 
the stack, and consequently data intended for the buffer replaces the return address. 
Overflow data supplied by an attacker may therefore specify a new return address, thus 
altering operation of a program containing the overwritten return address. 

A common technique for an attacker to alter a return address is to supply program code lo 
15 be executed in a buffer, and make the new return address point into that code; this makes 
the program execute arbitrary code inserted by the attacker. Another tactic is arc injection, 
which is a method that involves returning into an address in a known common library (such 
as the G standard library), to execute a C function such as systemQ, which will execute a 
command on the host machine. 

20 In this embodiment of the invention a number of different disassembled software programs 
are employed. Each program is broken down into individual instnjctions that form the 
program, where each instruction is described by a number of attributes including a program 
identifier (to Indicate which program the instruction belonged to), the address of the 
instruction, the instruction operator and a list of the instructions operands. Each program is 

25 labelled with a single Boolean attribute which indicates "true" if the program is known or 
suspected to contain a vulnerability and "false" otherwise. Background knowledge which is 
used includes such functions as identifying a copying (oop within a program. A copying 
loop is defined as a portion of code that (in any order) copies to a register (the Temporary 
Register) from a source pointer, changes the source pointer, copies from th£ register into a 

30 destination pointer, changes that destination pointer, and has a control flow path from the 
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end of the code back to the start (thus forming a loop).. Other definitions of a similar nature 
are also applicable. 

Benefits of applying the invention to characterisation of software vulnerablittes comprise: 

* enabling prevention of intrusions, by detecting features that an intruder might 

5 use, before the intruder has a chance to do so - "prevention is better than cure"; 

* no reliance on access to source code; 

* potential for faster operation than source code static analysis; 

* potential to be more effective than existing black-box testing, because this 
embodiment studies tine code for known semantic patterns, rather than probing 

10 ft for potential bugs; 

* characterisations in the form of rule sets may be learnt automatically (rather 
than manually as in the prior art) from training data and any available 
background knowledge or rules contributed by experts- this reduces costs and 
duration of the characterisation process: 

15 * Rule sets which are generated by this process are human readable and are 

readily assessable by human experts prior to deployment within a fraud 
management system. 



This embodiment of the Invention employs inductive logic programming software 
implemented in the Prolog logic programming language previously described. The target 
20 concept description in this embodiment is a characterisation of software vulnerabilities to 
enable prediction of whether a compiled program is vulnerable or not The set of rules 
should be applicable to a new, previously unseen and unlabeled disassembled program 
and be capable of indicating accurately whether it is vulnerable or not, 

IF {set of conditions} THEN {program is vulnerable} (3) 

25 In addition to receiving labelled program data, the inductive iogic programming software 
may receive input of further information, i.e. concepts, facts of interest or functions that can 
be used to calculate values of interest e.g. facts regarding the existence of copying loops 
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within the compiled programs. As previously mentioned, this further information is known 
as background knowledge, and is normally obtained from an expert in the detection of 
software vulnerabilities. 

Examples of a small training data set, background knowledge and a rule set generated 
5 therefrom will now be given. 

Training data: 

The training data is a set of disassembled software programs that are referenced by a 
program identifier. The target concept Is vutnerab)e( Program Identifier ). 

vulnerable(program1 ). 
10 vulnerable(program2). 



;-vulnerabte(programx). 
:-vulnerabie(programy). 
The first two of the four statements immediately above are specifying that programl and 
15 program2 contain vulnerabilities. The third and fourth of these statements arB preceded by 
the symbols specifying that programx and programy do not contain vulnerabilities. 
These form positive and negative examples for learning the concept of a vulnerable 
program. 

Instructions that form the programs can be stored in a number of formats. An initial simple 
20 format is in Prolog facts with one fact per instruction: 

stmple_instruction( Program Identifier . instruction Address , Instruction Operator, 

Instruction Operand List ). ). 
A sample of an example set of simple instruction data associated with the program with 
program identifier "programl" is shown below. 

25 simpleJnstructon(program1_exe f x401D00 T mov,[x8 T [esp r xl],eax]}. 
simp1ejnstruction(program1_exe,x401004 l mov,[eax l edx]). 
simp!ejnstruction(program1_exe r x401006 l mov t [[eaxl I ci]}. 




3S 

simpleJnstruct!on(programl_exe T x401008 t inc s [eax]). 

simple jnstrucflontprograrnl^exe^OipOB.testtcl.cI]). 

simple Jnstruction(program1_sxe r x401 00bjne,[x401 006]). 

simplejnstruction(program1_exe r x40100d,push 4 [esi]}. 
5 simplejnstruction(program1_exe,x40l00e f push,[edi3). 

simpleJnstruction(program1_exG l x40100f l mov l [xc l [esp J x1] I edi]). 

simpleJnstruction(program1_ex© ? x401013 F sub,[edx l eax]). 

slmpleJnstRiCtion(p'rogram1_exe,x401015.dec l [©dg). 
* simpleJnstruction(program1_exe I x401016 P rnov,ftx1 s [ediI s d]). 
10 sirnpie_instruction(program1_exe f x401019,inc,[edi]). 

slmplejnstruction(program1jsxe f x40101a,test[cl,cl])- 

Simp{ejnstruction(program1_exe,x40l01cjne l [x401016]). 

simpieJnstruction(pragram1_exe I x4010le,mQV,[eax,ecx]). 

simpJeJnstruction(program1_exe,x401020 T shr,[x2,ecx3). 
15 s!mplejnstruction(programi_exe l x401023,mov l [edx,esi]). 

simplejnstnjction(program1„exa,x401 025 1 repz,[mavsI ? ds 1 [Bsi] I es I [edi]]). 

simplejnstruction(program1_Bxe r x401027,mov J [sax,ecx]), 

S]mpiejn$truction(program1_exe,x401029 l and,[x3,ecx]). 

simple JrTstruction(programl_exs,x40102c T push I [x408040]). 
20 simple _lnstructlon(programl_exe,x401 031 ^Bpz^movsb^ds.tesil^s.Iadi]])- 

simpleJnstructionfprQgraml^exe^OIOSa^all.^OI^O]), 

$imple„instrLiction{program1_exe,x403l73 f mov f [[esiLar|). 

simplejnstructlon(program1_exe,x403175 l add l [x1 l esi]). 

simp]eJnstruct!on(pragrarn1_exe J x403178,mov T tal i [ediII). 
25 simplejnstructionCprograrnl^exe^OS^a.add.IxI.edi]). 

simple Jnstruction(program1_exe l x40317d, test ,[31,31]). 

simplejnstruction(program1_exe,x40317fje,[x403lbS]). 

simplejnstructson(program1_exe ! x403181 t siib ! [x1,ebx]). 

simplejn$truction(program1_exe,x403l84jne l [x403173]). 
i, *. 

30 The simple instructions can also be transformed into a graph-like format of nodes, with 
each node represented as a Prolog fact containing either a sequence of non-branching 
code, or a branching instruction, e.g.: 
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node_instruction( Program Identifier , Block Start Index , Block Length , List of 
Triples: ( Instruction Address , Instruction Operator, Instruction Operand List ) ). 

Block Length in the above instruction can be zero, which indicates that the list contains a 
single branching instruction. A branch is a program control structure in which one or more 
alternative sets of instructions are selected for execution. The selection is carried out when 
the program is run by means of a branching instruction. 



A sample of an example set of graphically represented instruction data is shown below. 

node_instruction(program1_exe r 1 r 5,[(x401 000,mav,[x8,[esp,x1],eax]),(x401 004,mo 
v,Eeax,edx3),(x401006,mov,[[eax],cl]),{x401008,inc.[eax]),(x401009,test,[cl,cl])3). 

10 node_instruction(program-|_exe l 6 ( 0,l(x40100b,jne.[x40100S])]). 

nodeJnstruction(program1_exe,7.8,[(x40100d,push,Eesi]),(x40100e,push,tedrj),(x40 
1 00f,mov l [xc.Cesp,x1],edi]),(x40101 3,sub,[edx,eax]),(x401 01 5,dec,[edi]},(x401 01 6,m 
ov,[[x1,[ediTJ,ci]),{x401019.inc,[edrj),(x40101a,test f [cl,c!])]). „> 

nodejnstruction(program1_exe.15,0,[(x40101c,jne,tx401016])]). 

node_instruction(program1„exe,16.8.[(x40l01e > mov,[eax,ecx]),(x401020,shr,[x2,ec 
x]),{x401023 > mov,[edx,esiI) l {x401025,repz.[movsl < ds,[esi3 l es,[edi]]),(x401027 ( moV r [ 
eax,ecx]),(x401029,and,[x3,ecxl),{x40102c,push,[x408040]),(x401031,repz,[movsb. 

ds.fesq.es.redirj)]). 

node_instruction(program1_exe,24,0,[(x401033,ca]I,[x401120])]). 



15 



20 



node_snstruction(program1„exe,3061,5,[(x403173 I mov,[[esi3 l all) J (x403175,add,[x1 l 
esi]),(x403178,mov l [al i [edi]3),(x40317a,add,[x1,edi3),(x40317d.test > [al 1 all)3). 

nodeJnstructionCprograml.exe.SOee^.^OS^fje.^OSIbS])]). 

nodejnstruction(program1_exe,3067,1 ,[(x4031 81 ,sub,[x1 ,ebx3)J). - 

25 node_instruction(program1_exe,3068 l 0 l [{x403184,]ne,[x403173]}]). 
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This graph format can then be normalised by splitting nodes to ensure that each branch 
always points to the start of a node (instruction sequence), never the middle. 

Background knowledge: this includes tests that are thought to be appropriate by a domain 
expert- Examples of appropriate background concepts, represented using Prolog, are: 

5 copyingjoop( Program Identifier , Loop Start Index , List of TripEes: ( Instruction Address T 
Instruction Operator, instruction Operand List ) , Temporary Register). 

The definition of a copying loop has been given previously, and an example is as follows:. 

copyingJoopCprograml^exe^Oei,! (x403173 y mov, [[esi], al]), (x403175 s add r [xl, 
esi]), (X403178, mov, [al, [edi]]), (x4Q317a< add, [x1, edi]), (x40317d, test, [al, al]), 
10 - - (x40317f, je, [x4031bS]), (x403181 ( sub, [x1, ebx]). (x4031S4, jne, [x403173])],al). 

1engthJoop( Program identifier , Loop Start index, List of Triples: ( Instruction Address , 
Instruction Operator, Instruction Operand List ) ). 

A length (finding) loop is defined as a portion of code that (in any order) copies to a register 
from a source pointer, changes the source pointer, and checks the register for a value of 
15 zero, and has a control flow path from the end of the code back to the start (thus forming a 
loop). 

length Joop{pragram1_exe,1 ,[ (X4G1006, mov, [[eaxj, cl]), (x4Q1008, inc r [eax]), 
(x401009 s test, [ct, cl]), (x40100b, jne, [x401006])],cl). 

!engtbJoop(program1_exe l 7 l [ (x4Q1016, mov, [[x1, [edi]]. cl]), (x401019, inc, [edi]), 
20 (x40l01a s test, [cl, cl]), (x40101c, jne, [x401016])],cl). 

fo(iows( Program Identifier , Block A Index , Block B Index, List of Triples: ( Instruction 
Address , Instruction Operator, Instruction Operand List ) ). 

This is an item of background knowledge which describes the situation 'If Block B follows 
Block A. Usually this is bounded by an upper limit of the number of instructions between 
25 Block A and Block B to prevent a large amount of background knowledge being generated 
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by the combinatorial nature of the predicate. The list of instructions between the two blocks 

is also stored in the background knowledge. E.g. 

follows(program1_exe,6,15,[{x40100d.push,Iesq),(x40100e,push,[edi]).{x40100f,mo 
v l lxc,Iesp l x1],edI]) l (x401013 ) sub,[edx i eax]).(x4010l5,dec,[edri),(x40101B > mov,I[x1,[ 
5 edi]3,cl]),(x401019,inc,[edi]),(x40101a r test,[cl,cl])]). 

strlen_call( Program Identifier , Strlenlndex ). 

This item of background knowledge indicates that the program makes a call to the function 
strien at the specified index,e.g.: 

strien_ca'll(prograrn1_exe, 1 000). 

io getjumpJests_from_llst( instructlonList , JumpTests). 

This item of background knowledge extracts tests that precede conditional jumps in an 
instruction list. 

empty_list( List). 

This item of background knowledge tests whether or not a given list is empty. 

15 single_item_list( List , RegisterTested ). 

This item of background knowledge tests whether or not a given list has a single jump test 
(conditional), and if it is, returns the tested register in RegisterTested. 

unreferenced_registers( InstructionList . Register ). 

This stem of background knowledge tests whether or not the given list modifies the given 
20 register. 

Generated rule set: The target concept is vulnerable( Program Identifier ). Rules in ihe 
following rule set characterises programs that are vulnerable to buffer overflows. 

vulnerable(Program) :- 
copyingJoop(Program , Copyinglndex, CopyingLoop, CopyingRegister), 
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str!en_call(Program, Strien Index), 

fallows(Program r Strlenlndex,CopyInglndex,lnstrBetween>, 
• getjump Jests JromJlst(lnstrBelween, JumpTests), 
empty JEst(JumpTests). 

vulnerable(Program):- 
copyingJoopCProgram, Copyinglndex, CopyingLoop, CopyingRegister), 
getJumpJestsJromJistfCopyingLoop, JumpTests), 
hasJestJbr_zero(JumpTests, TestForZero, OtherTests) r 
singleJtemJist(OtherTests, RegisterTested), 
unreferenced_registers(CopyingLoop, RegisterTested). 

The first rule of the above rule set classifies a program as vulnerable if there is a copying 
loop preceded by a call to the C function strien, with no conditional jumps between the two. 
The second rule classifies a program as vulnerable if there is a copying loop, with a test for 
15 zero, and one other test, but a register referenced by the other test is not used during the 
loop. 

The software vulnerability embodiment of the invention described above provides similar 
benefits to those associated with the fraud embodiment described with reference to Figure 
1 to 3: 

20 The processes described in the foregoing description can clearly be evaluated by an 
appropriate computer program comprising program instructions embodied in an appropriate 
carrier medium and running on a conventional computer system. The computer program 
may be embodied in a memory, a floppy or compact or optical disc or other hardware 
records! medium, or an electrical or optical signal. Such a program Is straightforward for a 

25 skilled programmer to implement on the basis of the foregoing description without requiring 
invention, because it involves well known computational procedures. 
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Claims 

1. A method of anomaly detection characterised in that it incorporates the steps of> 

a) developing a rule set of at least one anomaly characterisation ml© from a 
training data set and any available relevant background knowledge using at 
least first order logic, a rule covering a proportion of positive anomaly 
examples of data in the training data set, and 

b) applying the rule set to test data for anomaly detection therein, 

2. An automated method of anomaly detection characterised in that it comprises using 
computer apparatus to execute the steps of;- 

a) developing a rule set of at least one anomaly characterisation rule from a 
training data set and any available relevant background knowledge using at 
least first order logic, a rule covering a proportion of positive anomaly 
examples of data in the training data set, and 

b) applying the rule set to test data for anomaly detection therein. 

3. A method according to Claim 2 characterised En that it includes developing the rule set 
using Higher-Order logic. 

4. A method according to Claim 3 characterised in that it includes developing the rule set 
by: 

a) forming an alphabet having selector functions allowing properties of the 
training data set to be extracted, together with at least one of the following: 
additional concepts, background knowledge constant values and logical AND 
and OR functions, 

b) forming current rules from combinations of items in the alphabet such that type 
consistency and variable consistency is preserved, 

c) evaluating the current rules for adequacy of classification of the training data 
set, 

d) if no current rule adequately classifies the training data set, generating new 
rules by applying at least one genetic operator to the current rules, a genetic 
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operator having one of the following functions: r) combining two rules to form a 
new rule, ii) modifying a single rule by deleting one of its conditions or adding a 
new condition to it or iii) changing one of a rule's constant values for another 
. of an appropriate type, and 

e) designating the new rules as the current rules and iterating steps c) onwards 
until a current rule adequately classifies the training data set or a 
predetermined number of Iterations is reached. 

5. A method according to Claim 2 characterised in that data samples in the training data 
set have characters indicating whether or not they are associated with anomalies. 

6. A method according to Claim 5 characterised in that it is a method of detectfng 
telecommunications or retail fraud from anomarous data. 

7. A method according to Claim 8 characterised in that it employs inductive logic 
programming to develop the rule set, 

8. A method according to Claim 7 characterised in that the at least one anomaly 
characterisation rule has a form that an anomaly is detected or otherwise by 
application of the rule according to whether or not a condition set of at least one 
condition associated with the rule is fulfilled. 

9. A method according to Claim 8 characterised in that the at least one anomaly 
characterisation rule is developed by refining a most general rule by at least one of: 

a) addition of a new condition to the condition set; and 

b) unification of different variables to become constants or structured terms. 

10. A method according to Claim 9 characterised in that a variable fn the at least one 
anomaly characterisation rule which is defined as being in constant mode and is 
numerical is at (east partly evaluated by providing a range of values for the variable, 
estimating an accuracy for each value and selecting a value having optimum 
accuracy. , *; 

11. A method according to Claim 10 characterised in that the range of values (s a first , 
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range with values which are relatively widely spaced, a single optimum accuracy value 
is obtained for the variable, and the method includes selecting a second and relatively 
narrowly spaced range of values in the optimum accuracy value's vicinity, estimating 
an accuracy for each value In the second range and selecting a value in the second 
range having optimum accuracy. 

12. A method according to Cfaim 11 characterised in that it includes filtering to remove 
rule duplicates and rule equivalents, i.e. any rule having like but differently ordered 
conditions compared to another rule, and' any rule which has conditions which are 
symmetric compared to those of another rule. 

13. A method according to Claim 12 characterised in that it includes filtering to remove 
unnecessary 'less than or equal to' ("Iteq") conditions. 

14. A method according to Claim 13 characterised in that the unnecessary "Iteq" 
conditions are associated with at least one of ends of intervals, multiple Iteq 
predicates and equality condition and Iteq duplication. 

15. A method according to Claim 7 characterised in that it includes Implementing an 
encoding length restriction to avoid overfilling noisy data by rejecting a rule refinement 
if the refinement encoding cost in number of bits exceeds a cost of encoding the 
positive examples covered by the refinement. 

16. A method according to Claim 7 characterised in that it includes stopping construction 
of a rule if at least one of three stopping criteria is fulfilled as follows: 

a) the number of conditions in any rule in a' beam of rules being processed is 
greater than or equal to a prearranged maximum rule length, 

b) no negative examples are covered by a most significant rule, which is a rule 
that: 

i) is present in a beam currently being or having been processed, 

ii) ■ is significant, 

iii) has obtained a highest likelihood ratio statistic value found so far, and 

iv) has obtained an accuracy value greater than a most general rule 
accuracy value, and 
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c) no refinements were produced which were eligible to enter the beam currently 
being processed in a most recent refinement processing step (32). 

17. A method according to Claim 16 characterised in that it includes adding the most 
significant rule to a list of derived rules and removing positive examples covered by 
the most significant rule from the training data set, 

18. A method according to Claim 7 characterised in that it includes: 

a) selecting rules which have not met rule construction stopping criteria, 

5j se i eo tmg a subset of refinements of the selected rules associated with 
accuracy estimate scores higher than those of other refinements of the 
selected rules, and 

c) iterating a rule refinement, filtering and evaluation procedure (32 to 38} to 
identify any refined rule usable to test data, 

19. Computer apparatus for anomaly detection characterised in that it is programmed to 
execute the steps of;- 

a) developing a rule set of at least one anomaly characterisation rule from a 
training data set and any available relevant background knowledge using at 
least first order logic, a rule covering a proportion of positive anomaly 
examples of data in the training data set. and 

b) applying the rule set to test data for anomaly detection therein, 

20. Computer apparatus according to Claim 19 characterised In that it is programmed to 
develop the rule set using Higher-Order logic. 

21. Computer apparatus according to Claim 20 characterised in that it includes developing 
the rule set by: 

a) forming an alphabet having selector functions allowing properties of the 
iVi training data set to^be extracted, together with at least one of the following: 
additional concepts, background knowledge constant values and logical AND 
and OR functions, 
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b) forming current rules from combinations of items in the alphabet such that type 
consistency and variable consistency is preserved, 

c) evaluating the current rules for adequacy of classification of the training data 
set, 

d) if no current rule adequately classifies the training data set, generating new 
rules by applying at least one genetic operator to the current rules, a genetic 
operator having one of the following functions: i) combining two rules to form a 
new rule, ii) modifying a single rule by deleting one of its conditions or adding a 
new condition to it, or iii) changing one of a rule's constant values for another 
of an appropriate type, and 

e) designating the new rules as the current rules and iterating steps c) onwards 
until a current rule adequately classifies the training data set or a 
predetermined number of iterations is reached. 

22. Computer apparatus according to Claim 19 characterised in that data samples in the 
training data set have characters indicating whether or not they are associated with 
anomalies. 

23. Computer apparatus according to Claim 19 characterised in that the at least one 
anomaly characterisation rule has a form that an anomaly is detected or otherwise by 
application of such rule according to whether or not a condition set of at least one 
condition associated with that rule Is fulfilled. 

24. Computer apparatus according to Claim 19 characterised in that the at least one 
anomaly characterisation rule is developed by refining a most general rule by at least 
one of: 

a) addition of a new condition to the condition set; and 

b) unification of different variables to become constants or structured terms. 

25. Computer apparatus according to Claim 24 characterised in that a variable in the at 
least one anomaly characterisation rule which is defined as being in constant mode 
and is numerical is at [east partly evaluated by providing a range of values for the 
variable, estimating an accuracy for each value and selecting a value having optimum 
accuracy. 
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26. Computer apparatus according to Claim 23 characterised in that it is programmed to 
filter out at least one of rule duplicates, rale equivalents and unnecessary 'less than or 
equal to f ("Jteq") conditions. 

27. Computer apparatus according to Claim 23 characterised in that it is programmed to 
stop construction of a rule if at least one of three stopping criteria is fulfilled as fallows: 

d) the number of conditions in any rule in a beam of rules being processed is 
greater than or equal to a prearranged maximum rule length, 

e) no negative examples are covered by a most significant rule, which is a rule 
that: 

i) is present in a beam currently being or having been processed, 

ii) is significant, 

iii) has obtained a highest likelihood ratio statistic value found so far, and 

iv) has obtained an accuracy value greater than a most general rule 
accuracy value, and 

f) no refinements were produced which were eligible to enter the beam currently 
being processed in'a mosf recent refinement processing step. 

28. Computer software for use in anomaly detection characterised in that it incorporates 
instructions for controlling computer apparatus to execute the steps of> 

a) developing a rule set of at least one anomaly characterisation rule from a 
training data set and any available relevant background knowledge using at 
feast first order logic, a rule covering a proportion of positive anomaly 
examples of data in the training data set, and 

b) applying the rule set to test data for anomaly detection therein. 

29. Computer software according to Claim 28 characterised in that it incorporates 
instructions for controlling computer apparatus to develop the rule set using Higher- 
Order logic. 

30. Computer software according to Claim 29 characterised in that it incorporates 
instructions for controlling computer apparatus to develop the rule set by: 
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a) forming an alphabet having selector functions allowing properties of the 
training data set to be extracted, together with at least one of the following: 
additional concepts, background knowledge constant values and logical AND 
and OR functions, 

b) forming current rules from combinations of items in the alphabet such that type 
consistency and variable consistency is preserved, 

c) evaluating the current rules for adequacy of classification of the training data 
set 

d) if no current rule adequately classifies the training data set, generating new 
rules by applying at least one genetic operator to the current rules, a genetic 
operator having one of the following functions: i) combining two rules to form a 
new rule, ii) modifying a single rule by deleting one of its conditions or adding a 
new condition to it, or iii) changing one of a rule's constant values for 'another 
of an appropriate type, and 

e) designating the new rules as the current rules and iterating steps c) onwards 
until a cunrent rule adequately classifies the training data set or a 
predetermined number of iterations is reached. 

31. Computer software according to Claim 28 characterised in that data sampfes in the 
training data set have characters indicating whether or not they are associated with 
anomalies. 

32. Computer software according to Claim 28 characterised in that the at least one 
anomaly characterisation rule has a form that an anomaly is detected or otherwise by 
application of such rule according to whether or not a condition set of at ieast one 
condition associated with that rule is fulfilled. 

33. Computer software according to Claim 28 characterised in that it incorporates 
instructions for controlling computer apparatus to develop the at ieast one anomaly 
characterisation rule by refining a most general rule by at least one of: 

' a) addition of a new condition to the condition set; and 

b) unification of different variables to become constants or structured terms. 
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34. Computer software according to Claim 33 characterised in that it incorporates 
instructions for controlling computer apparatus to at least partly evaluate a variable in 
the at least one anomaly characterisation rule which is defined as being in constant 
mode and is numerical by providing a range of values for the variable, estimating an 
accuracy for each value and selecting a value having optimum accuracy. 

35. Computer software according to Claim 32 characterised in that it incorporates 
instructions for controlling computer apparatus to filter out at least one of rule 
duplicates, rule equivalents and unnecessary 'less than or equal to* ("Iteq") conditions. 

36. Computer software according to Claim 32 characterised in that it incorporates 
instructions for controlling computer apparatus to stop construction of a rufe if at least 
one of three stopping criteria is fulfilled as follows: 

g) the number of conditions in any rule in a beam of rules being processed is 
greater than or equal to a prearranged maximum rule length, 

h) no negative examples are covered by a most significant rule, which fs a rule 
that: ; 

i) is present in a beam currently being or having been processed; 

ii) is significant, 

iii) has obtained a highest likelihood ratio statistic value found so far, and 

iv) has obtained an accuracy value greater than a most genera! rule 
accuracy value, and 

i) no refinements were produced which were eligible to enter the beam currently 
being processed in a most recent refinement processing step. 
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ABSTRACT 

A method of anomaly detection applicable to telecommunications or retail fraud or software 
vulnerabilities uses inductive logic programming to develop anomaly characterisation rules 
from relevant background knowledge and a training data set, which includes positive 

5 anomaly samples of data covered by rules. Data samples include 1 or 0 indicating 
association or otherwise with anomalies. An anomaly is detected by a rule having condition 
set which the anomaly fulfils. Rules are developed by addition of conditions and unification 
of variables, and are filtered to remove duplicates, equivalents, symmetric rules and 
unnecessary conditions. Overfffling of noisy data is avoided by an encoding cost criterion. 

10 Termination of rule construction involves criteria of rule length, absence of negative 
examples, rule significance and accuracy, and absence of recent refinement. Iteration of 
rule construction involves selecting rules with unterminated construction, selecting rule 
refinements associated with high accuracies, and iterating a rule refinement, filtering and 
evaluation procedure (32 to 38) to identify any refined rule usable to test data. Rule 

15 v development may use first order logic or Higher Order logic- 
Figure 3 should accompany the Abstract 
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